Opal: In Vivo Based Preservation Framework for Locating Lost Web Pages
نویسندگان
چکیده
OPAL: IN VIVO BASED PRESERVATION FRAMEWORK FOR LOCATING LOST WEB PAGES Terry L. Harrison Old Dominion University, 2005 Director: Dr. Michael L. Nelson We present Opal, a framework for interactively locating missing web pages (http status code 404). Opal is an example of "in vivo" preservation: harnessing the collective behavior of web archives, commercial search engines, and research projects for the purpose of preservation. Opal servers learn from their experiences and are able to share their knowledge with other Opal servers using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). Using cached copies that can be found on the web, Opal creates lexical signatures which are then used to search for similar versions of the web page. Using the OAI-PMH to facilitate inter-Opal learning extends the utilization of OAI-PMH in a novel manner. We present the architecture of the Opal framework, discuss a reference implementation of the framework, and present a quantitative analysis of the framework that indicates that Opal could be effectively deployed.
منابع مشابه
Research on discovering deep web entries
Ontology plays an important role in locating Domain-Specific Deep Web contents, therefore, this paper presents a novel framework WFF for efficiently locating Domain-Specific Deep Web databases based on focused crawling and ontology by constructing Web Page Classifier(WPC), Form Structure Classifier(FSC) and Form Content Classifier(FCC) in a hierarchical fashion. Firstly, WPC discovers potential...
متن کاملNavigating the World Wide Web
Navigation (colloquially known as “surfing”) is the activity of following links and browsing web pages. This is a time intensive activity engaging all web users seeking information. We often get “lost in hyperspace” when we lose the context in which we are browsing, giving rise to the infamous navigation problem. So, in this age of information overload we need navigational assistance to help us...
متن کاملThe Automatic Extraction of Web Information Based on Regular Expression
Based on search engine , this paper built a Web information retrieval matching and structure extraction model. And realized the algorithm of locating and automatically extracting multi-web Baidu news information. Getting the standard mathematical expression of URLs by analyzing the search results URLs and analyzing the DOM tree structure of web pages, this article designed the key tags regular ...
متن کاملارزیابی کیفیت صفحات وب پژوهشگاههای وابسته به وزارت علوم، تحقیقات و فنآوری مستقر در شهر تهران از دیدگاه کاربران
Especially in research centers, evaluating the quality of web pages from clients' point of view has a constructive role in their design and development, since it makes the web developers familiar with client's perspective and assists them in designing client-oriented web sites in scientific and research environment. As a model for assessing the quality of web pages, "webQual" attempts to provid...
متن کاملPrioritize the ordering of URL queue in Focused crawler
The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...
متن کامل